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Abstract 

Inspired by ideas taken from the machine learning literature, new regularization techniques have been recently introduced in linear system 
identihcation. In particular, all the adopted estimators solve a regularized least squares problem, differing in the nature of the penalty term 
assigned to the impulse response. Popular choices include atomic and nuclear norms (applied to Hankel matrices) as well as norms induced 
by the so called stable spline kernels. In this paper, a comparative study of estimators based on these different types of regularizers is 
reported. Our findings reveal that stable spline kernels outperform approaches based on atomic and nuclear norms since they suitably embed 
information on impulse response stability and smoothness. This point is illustrated using the Bayesian interpretation of regularization. 
We also design a new class of regularizers defined by “integral” versions of stable spline/TC kernels. Under quite realistic experimental 
conditions, the new estimators outperform classical prediction error methods also when the latter are equipped with an oracle for model 
order selection. 
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1 Introduction 

Prediction error methods (PEM) are a classical tool for re¬ 
constructing the impulse response of a linear system starting 
from input-output measurements [30]. In the simplest sce¬ 
nario, one postulates a single model structure, e.g. a rational 
transfer function which depends on an unknown parameter 
vector g of known dimension. If the model contains the 
“true” system, and the noise source is Gaussian, estimation 
of g by PEM enjoys optimal asymptotic properties. In par¬ 
ticular, this estimator can not be outperformed by any other 
unbiased estimator as the number of measurements goes to 
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infinity. 

However, in real applications, not only the number of mea¬ 
surements is always finite but also model complexity is 
typically unknown. This means that different model struc¬ 
tures need to be introduced, e.g. rational transfer functions 
of different order. Each of them has to be fitted to data 
by PEM and then compared resorting e.g. to validation 
techniques (cross validation) or complexity criteria such 
as Akaike’s criterion (AIC) [3, 18]. Recent studies have 
however illustrated some limitations of this approach. Eor 
instance, when the data set size and/or the signal to noise 
ratio is not so large, it can return models with a non satis¬ 
factory prediction power on new data [39], 


The above issues have motivated the development of an al¬ 
ternative route to system identification based on regulariza¬ 
tion techniques. The starting point is the use of high-order 
EIR models in combination with penalty terms (regulariz¬ 
ers) on the impulse response g. In particular, an important 
class of estimators solves a convex optimization problem of 
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the form 

argmin V (g) + rJ(g), 7 > 0. 

g 

Above, V is the so-called loss function that depends also on 
the input-output measurements and measures the adherence 
to experimental data. It is often given by the sum of squared 
residuals (the choice also adopted in this paper), leading to 
regularized least squares (ReLS). The term J is instead the 
regularizer which typically includes smoothness informa¬ 
tion on g. Finally, the positive scalar 7 is the regularization 
parameter which has to suitably balance V and J. In practice, 
it is always unknown and has to be determined from data. 


An important advantage of ReLS over classical PEM is that 
the difficult model order determination can be replaced by 
estimation of few regularization variables. For instance, the 
stable spline estimator introduced in [38] depends only on 
two parameters: the regularization parameter 7 and another 
one which enters J by determining how fast g is expected 
to decay to zero. Such variables can be determined e.g. by 
cross validation [18], Cp statistics [26, subsection 7.4] or 
marginal likelihood optimization [31,32], an approach re¬ 
cently proved to be much effective [11,34,38]. In particular, 
in the spirit of Stein’s effect [28], a carefully tuned regular¬ 
ization can much reduce the variance of the estimates just 
introducing a small bias in the identification process. Hence, 
the mean squared error of the estimator can turn out infe¬ 
rior than that achieved by PEM [11,38]. However, to obtain 
this, the choice of J is crucial since it has a major effect on 
the quality of impulse response reconstruction. It is thus of 
interest to investigate and compare the performance of dif¬ 
ferent regularizers recently proposed in the system identifi¬ 
cation literature. 

A recent regularization approach relies on the so called nu¬ 
clear norms. The nuclear norm (or trace norm) of a matrix is 
the sum of its singular values. Being the convex envelope of 
the rank function on the spectral ball, it has been often used 
as a convex surrogate of the rank function [20]. In particu¬ 
lar, since the McMillan degree (minimum realization order) 
of a discrete time time-invariant system coincides with the 
rank of the Hankel operator constructed from the impulse 
response coefficients, it is tempting to set J to the nuclear 
norm of the Hankel operator associated with the impulse 
response g. In this way, output data are described while en¬ 
couraging a low McMillan degree, see [25,29,33,48] and 
also [27,47] for extensions of this idea for estimation e.g. 
of Box-Jenkins models. 

Atomic norms have been also adopted as regularizers for 
system identification in the last years [9]. The function to be 
reconstructed is described as the sum of a (possibly infinite) 
number of basis functions, dubbed atoms. The penalty J is 
then given by the atomic norm defined, under some tech¬ 
nical conditions, by the convex hull of the atomic set. For 
instance, the convex hull of one-sparse vectors of unit Eu¬ 
clidean norm leads to the norm which enjoys important 
recovery properties [17], being also related to the popular 


LASSO procedure [50]. 

A motivation underlying the use of the atomic norm is that, 
in some sense, it represents the best convex penalty when 
the function is sum of few atoms. Successful applications in 
signal processing and machine vision regard estimation of 
sparse vectors and low-rank matrices [2,8,16]. This tech¬ 
nique has been recently introduced also in the system identi¬ 
fication scenario in [46] exploiting low-order rational trans¬ 
fer functions as atomic set, see also [42,43] for other ap¬ 
proaches that use the penalty. 

Other recent system identification techniques exploit penalty 
terms induced by kernel functions [5,45], which have to cap¬ 
ture the expected features of the unknown function. In partic¬ 
ular, the works [11,37,38] have proposed a class of kernels 
for system identification, including stable spline (SS)/tuned- 
correlated (TC), which encodes information on smoothness 
and exponential stability of g, see also [13] for other insights 
on the stable spline kernel structure. 

The goal of this paper is to compare the performance of 
ReLS equipped with atomic, nuclear or kernel-based norms 
via numerical studies. For this purpose, we will perform a 
Monte Carlo experiment where at every run the true system 
is a different rational transfer function randomly chosen by 
a MATLAB generator. The system input and the signal to 
noise ratio also varies from run to run, thus leading to a large 
variety of system identification problems. The results reveal 
that, in many cases, atomic and nuclear norms lead to un¬ 
satisfactory impulse response estimates. The drawbacks of 
these approaches are explained through the Bayesian inter¬ 
pretation of regularization, where different J are seen as dif¬ 
ferent a priori probability density functions (pdf) assigned 
to g. This interpretation explains why the current implemen¬ 
tations of atomic and nuclear norms fail in capturing impor¬ 
tant characteristics of stable dynamic systems. As far as the 
nuclear norm is concerned, when used in conjunction with a 
FIR model of order m, we shall see that it essentially corre¬ 
sponds to modeling the impulse response coefficients g, as a 
non stationary white noise process with a variance roughly 
decaying as 1 /f for f < m /2 and increasing as l/(m —f) for 
t > m/2. This means that the prior underlying this approach 
does not embed any information on impulse response stabil¬ 
ity and smoothness, two key features to achieve good mean 
squared error properties [39]. As for ReLS equipped with the 
atomic norm described in [46], we will see that it suffers of 
the limitations of the LASSO procedure recently discussed 
in [4]. 

The results from the same Monte Carlo study will also show 
that estimators based on stable kernels outperform ReLS re¬ 
lying on atomic and Hankel nuclear norms. Furthermore, 
we will design a new class of regularizers induced by “in¬ 
tegral” versions of stable spline kernels. This will lead to 
novel ReLS approaches for system identification that may 
outperform also the classical PEM equipped with an oracle 
for model complexity selection (the oracle selects the best 
model order exploiting information on the true system). 
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The paper is so organized. Section 2 reports the problem 
statement while in Section 3 we briefly review the ReLS es¬ 
timators based on the Hankel nuclear norm and the atomic 
norm. Section 4 gives insights about limitations of the tech¬ 
niques described in Section 3 using the Bayesian interpreta¬ 
tion of regularization. In Section 5 we briefly review stable 
spline kernels introduced in [36, 38], also introducing the 
concepts of stable spline atomic set. In Section 6, new inte¬ 
gral versions of stable spline kernels are introduced and used 
to define novel ReLS estimators for linear system identifi¬ 
cation. All of the estimators are then tested via Monte Carlo 
studies in Section 7. Conclusions end the paper while some 
mathematical details are gathered in Appendix. 

2 Problem statement 

We use u{t) and y(t) to denote, respectively, the input and 
the noisy output of a SISO system observed at time instant 
t. For convenience of notation we shall also use the notation 
y, :=y{ti). The measurement model we consider is of the 
Output-Error (OE) type: 

y,'= i=l,...,N (2.1) 

where g is the system impulse response, (g 0 m),- denotes 
the convolution in discrete time between g and u evaluated 
at tj. Einally, e, are independent Gaussian noises of constant 
variance a . The results could be straightforwardly extended 
to non stationary noises and/or ARMAX/BJ type structures. 
Our problem is to estimate the impulse response g assuming 
the input u is (deterministic and) known and having collected 
the measurements T = [yj ... y^r]^. 

Remark 2.1 In the sequel, we will discuss regularized esti¬ 
mators based on HR (infinite impulse response) or FIR (fi¬ 
nite impulse response) models for g and equipped with a 
regularizer J{g). In the HR case, following the well known 
concept of BIBO stability, J is said to include the stabil¬ 
ity constraint if it embeds the constraint FIR 

models are already structurally stable. However, also in this 
case we will often use expressions as lack of stability con¬ 
straint to mean that J does not include the information that 
impulse response is expected to decay to zero as a function 
of the time lag. In particular, this concept will be formalized 
in Bayesian terms. 


of its singular values, i.e. 


I|a||*=Ect,(a). 

i 


The motivation underlying the use of this type of penalty for 
system identification stems from the fact that the minimum 
realization order (also known as McMillan degree) of a dis¬ 
crete time LTI system coincides with the rank of the Hankel 
operator H{g) constructed from the impulse response coef¬ 
ficients gj^, i.e. 


H{8) 


^ 8i 82 83 
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(3.1) 


Then, as described e.g. in [25,33], an estimator trading data 
fit with low McMillan degree can be obtained by solving 


argmin 

g 


N 


E (yi ~ 


{8<^u)i)^ + Y\\H{g)\\„ 


(3.2) 


3.2 Regularization via atomic norm 

A different, but related regularization approach to linear 
system identification, called the atomic norm regularization 
approach, was recently suggested in [46]. It describes the 
model using “atoms” and defines a model complexity mea¬ 
sure in terms of the “atomic norm”. This complexity mea¬ 
sure is then used for regularization. 

Let If’ be the complex plane and ^ = {w G'f’, |w| < 1} and 
consider below the discrete-time case. A simplistic account 
of the idea is as follows: Given a set of “atoms”, G„(z),w G 
(which can be interpreted as basis functions for the model 
transfer function), we can construct a linear model via linear 
combinations of the atoms. Eor a concrete feeling of the 
concept, think of the atoms as normalized first order system 
with pole (possibly complex) denoted hy w Gif), so 

G„(z)=^-^^. (3.3) 

z — w 


3 ReLS based on Hankel nuclear/atomic norms 

In this section we describe two ReLS approaches which use 
as regularizer J either the Hankel nuclear norm or a version 
of the atomic norm that can be approximated by the norm. 

3.1 Regularization via Hankel nuclear norm 

We introduce a regularizer based on the Hankel nuclear 
norm. Recall that the nuclear norm of a matrix A is the sum 


The transfer function of any linear system can be written as 
a finite or countable linear combination of these atoms: 

G{z)=Y,akG„^(z), WkG^. (3.4) 

k 

The atomic norm of this system is defined as 

l|G(z)IU = inf| E kl : (3.4) holds I. (3.5) 

J 
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It has been proved in [46] that the atomic norm well approx¬ 
imates the Hankel nuclear norm of G{z), thus motivating its 
use in system identification. 

For computational reasons, a finite number of atoms 

A: = 1, • • • , p are selected. If G„^ (z) is selected and 

then G^^» (z) shall be also within the selected p atoms where 

wl- is the complex conjugate of w^. Let the impulse response 
of G„^ (z) be and denote 


By the Bayes rule, the posterior of g is the product of the 
likelihood and the prior, i.e. p{g\Y) p{Y\g)p{g), so that 


-logp(g|y) = 


j{g) 

2(7^ 2A 


-f Const. 


Hence, setting 7 = one can conclude that ReLS (4.1) 

can be always seen as a maximum a posteriori (MAP) esti¬ 
mator. 


Then, recalling (3.4), the fit between the measured output 
and the model is 

!=1 \ k=l ) 

and the atomic norm regularized estimate becomes the 
LASSO like 

N / p ^ 

fl = argminE 7;-E«-tVW +^El“-tl- 

“ i=l \ k=l ) k=l 


The estimator (4.1) obtained from this Bayesian perspective 
will perform well only provided the density p{g), defined 
through (4.2), succeeds in capturing the essential features 
and possibly prior knowledge on the underlying dynamical 
system. In particular the prior should privilege stable and 
possibly smooth impulse responses. Now, it is of interest to 
elucidate the shape of the priors underlying the two ReLS 
approaches introduced in the previous section. 

4.1 The Hankel nuclear norm case 

According to the previous Bayesian interpretation of regu¬ 
larization, the nuclear norm regularizer 

J{g)=Y,a,{H) (4.3) 


is associated with the prior 


4 Bayesian interpretation of regularization: the Hankel 
nuclear norm and the atomic norm case 


p^(g)°cexp 




Every ReLS estimator 

N 

argmin E (A' -{g® u)if + yj (g) (4.1) 

^ (=1 

can be given a simple interpretation in Bayesian terms. 
To simplify the exposition, assume that the measurements 
model (2.1) can be written as a FIR of order m. Then, abus¬ 
ing notation, we use g to denote the (column) vector con¬ 
taining the impulse response coefficients gj,...,g^. More 
specifically, assume that g is a random vector with pdf 

p(g)°cexp^-^^ (4.2) 

where “ 0 =” stands for “proportional to” while A is a positive 
scale factor. Assume also that Y is corrupted by additive zero 
mean white Gaussian noise, independent of g, of variance 


p(F|g)ocexp 


Icp- ) ' 


Now, we would like to understand how well p^ describes 
typical features of an impulse response. By symmetry, it is 
easy to assess that g is zero-mean but then a simple closed 
form expression for the distribution of its components g^- is 
not available. In Appendix 9.1 we describe a Markov chain 
Monte Carlo (MCMC) scheme to efficiently sample from 
Pfj [ 22 ], also exploiting a close form approximation of of 
Pfj. Fig. 1 reports some results extracted from a chain of 
length le 6 setting 2X = l,g € The left 

panel shows the standard deviations of the impulse response 
coefficients (solid line) while the right panel displays in¬ 
formation on the correlation between the components of g. 
The outcomes show that, under the impulse response co¬ 
efficients g^- are (approximately) uncorrelated and with bath¬ 
tub variance. This suggests that the prior associated with the 
Hankel nuclear norm provides a weakly informative guess 
when searching for stable and smooth impulse responses. 
This analysis also shows that even an apparently reasonable 
regularizer could be associated to a meaningless prior dis¬ 
tribution. 

Remark 4.1 Lack of stability, manifesting itself in the fact 
that the impulse response prior variance does not decay to 
zero as a function of the lag index t, should not come as 
a surprise. In fact, for any fixed m and any given “stable” 
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partial impulse response it is always possible to find 

an unstable impulse response with the same Hankel 

nuclear norm. A simple example goes as follows: consider 
the (finite) Hankel matrix (g) built with the first m impulse 
response coefficients of g, = ca b and the matrix H^{h) 
built with hf = a"^ (a)' b. It is simple to check that 

Hm{g) can be obtained from H^{h) by reversing the order 
of rows and columns and thus 

Remark 4.2 In [11 ], working under a deterministic frame¬ 
work, it has been shown that favorable mean squared error 
(MSB) properties in the reconstruction of stable exponen¬ 
tials can be obtained letting the diagonal (and off-diagonal) 
elements of the regularization matrix decay exponentially to 
zero (as the entry number increases). In the Hankel case, 
it is shown in Appendix 9.1 that one approximately has 
Var{g() oc t * for 1 <t < and Var(gf) (m — t -\-\) * 
for < t < m. So, the Hankel regularizer does not in¬ 
clude exponential dynamics, a key point to well trade-off 
bias and variance. Note also that the particular profile of 
Var{gf) induced by Hankel also implies that the choice of m 
has an important effect on the structure of the Hankel norm 
regularizer. 

4.2 The atomic norm case 


sparsity, as documented in the Statistics literature [19,54] 
and also recently demonstrated and discussed in [4]. In par¬ 
ticular, assume that the true impulse response g is the sum 
of a finite number of atoms. For a large enough value of 
7 , all the aj which do not contribute to g will go to zero 
but the linear penalty in (3.6) will still yield to biased es¬ 
timates of the other expansion coefficients, so that the re¬ 
sulting estimate is oversmoothed (biased). An example of 
this phenomenon is illustrated in Fig. 2 (left panel) which 
reports the results obtained via a Monte Carlo study where 
the impulse response g is fixed while different noise and in¬ 
put realizations are generated at every run, in the same way 
as described in Section 7. The estimates of g are obtained 
by (3.6) using 300 input-output samples, with 7 chosen by 
an oracle. The latter has access to the true g and selects the 
regularization parameter which minimizes the mean squared 
error (the concept of oracle-based estimation is further de¬ 
tailed in subsection 7.2). Under a Bayesian viewpoint, the 
estimator suffers from the equal probability assigned to all 
the atomic functions so that the oracle can select few atoms 
only assuming that the prior variance (which is proportional 
to X) is quite low. Obviously, this introduces a bias. It is 
worth asking if the adoption of different atoms and an un¬ 
equal weighting on their expansion coefficients may be a 
remedy. We explore this issue in the next section. 


It has been shown in [46] that, restricting to {gt}f^^+ G ^ 2 ’ 
the atomic norm (3.6) is equivalent to the Hankel nuclear 
norm. For this equivalence to extend to the “finite” nuclear 
norm (4.3) the size of the Hankel matrix (3.1) (and thus 
the truncation index m) needs to grow to infinity. Note that 
the restriction {gt}f,^^+ G ^2 t® critical since, as we have 
seen, for any finite m the Hankel nuclear norm does not 
include a “stability” constraint. The situation is different with 
the atomic norm as any finite sum of (stable) atoms (3.4) 
will always result in a stable impulse response. Yet the 
penalty on the coefficients in (3.6) (see also (4.4) below) may 
introduce severe bias. More insights are discussed below. 
Let gf = Yfj=\CtiPj{t)- In view of (3.6), the atomic norm 
regularizer 

J{e) = L \aj\ (4.4) 

7=1 

amounts to a Bayesian prior 


FAiv(^)°=exp 


so that the unknown parameters a^ are Laplace distributed 
independent random variables, all having the same variance. 
As in the case of LASSO, the solutions of (3.6) enjoy some 
sparsity properties meaning that, when 7 increases, more 
and more elements of vector a are forced to zero. This may 
seem an appealing property but the results may fail to meet 
the expectations. In fact, the penalty may introduce an 
excessive penalty on some large coefficients aj to obtain 



5 Stable spline kernels and atoms 

In the first part of this section we review the stable spline 
kernel, comparing its features with the prior induced by 
the Hankel nuclear norm discussed in section 4.1. We then 
compare the structure of the stable spline estimator with the 
atomic approach of section 4.2, also introducing the concept 
of stable spline atoms. 


5.1 Stable spline kernels 


According to its stochastic interpretation, we have seen that 
every ReLS can be seen as a Bayesian estimator. In particu¬ 
lar, each quadratic penalty can be obtained modeling the im¬ 
pulse response as a particular (zero-mean) Gaussian process. 
This implies that fixing the covariance (kernel) is equivalent 
to fixing the quadratic regularizer. 

In system identification the kernel should include informa¬ 
tion on smoothness and exponential stability of the impulse 
response g. One choice suggested in the literature is the so 
called first-order Stable Spline kernel [11,36]. Letting E de¬ 
note expectation, for t = 1 , 2 ,... and i = 1 , 2 ,... it is defined 
by 

/:(i,f) = E[g,g,]0<a<l, (5.1) 

while it is null elsewhere. The second-order version was 
proposed in [38]. It is given by 


2 


^3max(.s,/) 

---, 0 < a < 1 


(5.2) 
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Fig. 1. Prior induced by the Hankel Nuclear Norm: the impulse response coefficients are contained in the vector g e modeled as 
a random vector with probability density function P[]{g) “ sxp( —[|//(g)||^). Here, || • ||^ is the nuclear norm while H(g) is the Hankel 
matrix (3.1) of size 50 x 50. Left, standard deviations of the impulse response coefficients reconstructed by MCMC (solid line) and 
approximated using the prior (9.1,9.2) (dashed line) derived in Appendix. The figure also displays the standard deviations of gi^ when 
g is a Gaussian random vector with stable spline covariance (5.1) for two different values of a (dashdot lines). Right: 50-th row of the 
matrix containing the correlation coefficients returned by the MATLAB command corrcoef (M) where each row of the Ie6x99 matrix 
M contains one MCMC realization of g under the Hankel prior Pnig) s>^p(^II^f(g)||*)- 

Atomic+Oracle ITS 




Fig. 2. Monte Carlo experiment with a fixed impulse response: true impulse response (thick line) and meanistandard deviation (dashed 
line, computed after 100 runs) of the estimators At-l-Or (left) and iTS (implemented as described in subsection 7.2). Recall that, differently 
from iTS, At-l-Or is not implementable in practice since y is tuned using the true impulse response. 


and leads to smoother impulse response realizations. Notice 
that both kernels are parametrized by a which is interpreted 
as an unknown hyperparameter. 


The left panel of Fig. 1 reports the standard deviations of a 
random vector g whose covariance is the sampled version of 
(5.1) with a set to 0.8 and 0.9 respectively (dashdot lines). 
Differently from the bathtub prior induced by the Hankel 
nuclear norm (solid line), the stable spline kernel describes 
system dynamics which go exponentially to zero. Further, 
the hyperparameter a enhances model flexibility since it 
permits to tune the decay rate. Also, while nonstationary 
white noise underlies the Hankel prior (see the right panel 
of Fig. 1), (5.1) introduces correlation among the impulse 


response coefficients, hence including information on im¬ 
pulse response smoothness. This is important to have good 
MSE properties as discussed in Remark 4.2. With the same 
remark in mind, note also that, differently from the Hankel 
nuclear norm case, the stable spline prior shape is indepen¬ 
dent of the selected FIR order, thus making its effect on the 
estimation process more transparent. If m is increased e.g. to 
2m, the statistics of the first m impulse response coefficients 
remain the same. 

5.2 Stable spline atoms and regularizer 

Let g be a discrete-time Gaussian stochastic process with co- 
variance proportional to the stable spline kernel (5.1). Then, 
the following proposition characterizes the architecture of 
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the minimum variance estimator of g given the measure¬ 
ments ( 2 . 1 ). 

Proposition 5.1 Let 


• when the stable spline prior (5.1) is used, according to 
(5.6), the impulse response estimate is searched in the 
subspace spanned by the functions Pj given by (5.5). 
It is then natural to call 


y,-= /=1,...,A^ 



where e,- are all mutually independent, Gaussian, with zero- 
mean and variance (7 . Assume also that g is a zero-mean 
Gaussian and causal process, independent of the noise, with 
covariance 

(5.3) 

with A a positive scale factor. Then, one has 

=f:C,P;(^)P/(0 (5.4) 

1=1 

where 


the (hrst-order) stable spline atomic set. Such a set is 
parametrized by the positive scalar a which measures 
the distance from instability. Note also that the stable 
spline atoms pj are ordered in such a way that their 
energy content at high frequencies increases as j aug¬ 
ments and that their structure is much different w.r.t. 
that adopted in [9] and reported in (3.3); 

• atomic norms are typically designed in such a way as to 
assign the same penalty to each expansion coefficient 
aj. Instead, according to (5.7,5.5), given g — E,J=i ^jPj’ 
the (hrst-order) stable spline estimator uses the stable 
spline regularizer 


V2sin(, C; = —- 


and 


where 


Eferll"] = Y^djPjf) 

1=1 


(5.6) 


d = argmin ^ ( y,- - £ hijaj 


1=1 \ 1=1 


\ (5.7) 

1=1 W 


with 


hii = {Pj®u)i, 7 =^. 


Finally, the estimate admits also the closed-form expression 


= Y^Ci{K{-,t)®u{-))i 

!=1 

where C; are the components of the vector c 
c:=(A + 74)-'y 


(5.8) 


(5.9) 


1=1 W 

where are weights decaying to zero. The expansion 
coefficients aj are thus constrained to decay to zero at a 
rate which guarantees both impulse response continuity 
and system stability. Hence, penalizing high-frequency 
components, also by adjusting 7 , the stable spline regu¬ 
larizer may privilege more parsimonious model|^ pos¬ 
sibly leading to estimators which better trade-off bias 
and variance; 

• the hrst step in the design of atomic-norm based es¬ 
timators is the selection of the atomic set whose con¬ 
vex hull then dehnes the regularizer. In our stochastic 
framework, both of these steps are condensed in the 
choice of the covariance (kernel). Indeed, the kernel 
encodes both the atoms Pj and the regularizer, dehned 
by ^j. This subsumes modeling and computational ad¬ 
vantages: one needs neither to choose the number of 
atoms to be used (the kernel includes an inhnite num¬ 
ber of basis functions Pj so that no truncation affects 
the impulse response representation) nor store any ba¬ 
sis function in memory. In fact, the estimate (5.8) can 
be computed using the kernel just inverting a matrix, as 
shown in (5.9). This feature is related to what is called 
the kernel trick in the machine learning literature [45]. 


with 


6 Integral versions of stable spline kernels 


= E “(7 - 0 ( E «('■ - k)K{t,k) ) (5.10) 

r=l V*:=l / 


The result above leads to the following comments: 


When the stable spline kernel is used, we have seen from 
(5.5) that the atoms Pj depend on a single parameter a 


Note that, when using the approach proposed in [46], the atoms 
reported in (3.6) contain the term 1 — |w|^ which implicitly pe¬ 
nalises basis functions close to the unit circle. However, no penalty 
is assigned to high-frequency impulse response components. 
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which establishes the decay rate of the eigenfunctions. Fol¬ 
lowing also [ 10 ], a way to further enrich the hypothesis 
space, making it more flexible, is to exploit different values 
of a. However, differently from [10], our aim is the syn¬ 
thesis of a new kernel (in closed form) able to “contain” an 
infinite number of different decay rates, but which remains 
function only of a finite number of hyperparameters. In our 
Bayesian framework, we can obtain this by modeling the 
impulse response g as the sum of i.i.d. stochastic processes 
whose stable spline kernels differ in the value of a. In par¬ 
ticular, let us introduce two hyperparameters: a„, related to 
the fastest system pole, and a^, connected with the slow¬ 
est dynamics. Then, we can consider p values a, satisfying 
ctm < cti < a 2 < ■ ■ ■ < oCp < and equally spaced with 
step A„, then building the kernel 

P , ^ 

A max(s,? -i\ 

A„£a,. ( 6 . 1 ) 

/=1 

which induces the richer atomic set 






where we still have = 


7 = 1 , 2 ,... and /= l, 2 ,...,pj 

(6.2) 

F ^ so 


that A„ —> 0 in (6.1), the sum becomes the integral 
j“m which leads to the ntw first-order integral 

stable spline kernel called iTC and reported in Table 1. The 
same procedure can be repeated starting from (5.2) obtain¬ 
ing the second-order integral stable spline kernel, called 
iSS in Table 1. 

Finally, these two kernels iTC and iSS can be summed up, 
obtaining the kernel dubbed iTS in Table 1. This kernel thus 
synthetizes an infinite number of atoms that not only have 
different decay rates a G [a„,a^] but also two different 
smoothness levels. 


A > 0 , 0 < a < 1 , rj = [X,a] 

/ fyk+J+maxik.j) 3max(k,j)\ 

ss = ^- 

A > 0, 0 < a < 1, 77 = [A,a] 


iTC 


PkjiPl) 
A > 0 , 


max(t:j) + l _ max(ij')+l 

_ ^ “M_ -HI _ 

max(A, 7 ) -f 1 

o< «„,<««<!; pI = [^,cc^,cCm] 


iSS 


1 k+j+miix{k,j) + l , 3max(*:j')+l 

p __ 

2(k-|-7-|-max(fe,7)-|-1) 18max(A,7)+6 


A(^*+f+max(t:j') + l 

2{k + j + max(fc, 7 ) + 1) 
A > 0 , 0 < < 1 ; 


Act3max(^.7) + 1 

18max(A;,7) -t-6 
77 = [X,a„,aM\ 


iTS 


max(7:j) + l _ max(i.7)+l 

^ _ 'fin _ 

max(A, 7 ) + 1 

1 k+j+ma.x(k,j) + \ , 3max(«:j') + l 

__ 

2(fc-f 7 + max(fc, 7 ) + 1) 18max(A;,7) -|-6 

0 „*+y+max(7:.7) + I j 3max(*:,7)+l 

_^ _ 

2{k-\- 7 -f max(fc, 7 ) -|- 1 ) 18 max(k, 7 ) -f 6 

A > 0 , 0 < «„, < < 1 ; 77 = [A, a^] 



Table 1 

List of regularization matrices P built using Stable Kernels 

T 2 

where -f (7 /„. Finally, the impulse response 

estimate is the solution of 

argminllT-Ogll^-fgV^'g 

g 


We can now introduce estimators based on the different reg¬ 
ularization matrices P which are function of the hyperpa¬ 
rameter vector 77 as reported in Table 1. The estimators have 
the same structure as documented in [38] and [11]. In par¬ 
ticular, assuming a high order FIR, it is useful to rewrite the 
measurements model ( 2 . 1 ) as 

T = 4>g-f£ (6.3) 


(note that J{g) has become g P g above), which is given 
by 


| = (6.5) 


with T] set to its estimate f). 


7 Simulated data: Monte Carlo studies 


Here, as in section 4, g now denotes the m-dimensional 
vector whose components are the impulse response coeffi¬ 
cients while the regression matrix <t> is defined by the in- 
put samples. Then, the noise variance a is estimated from 
the residuals obtained by fitting g via least squares. The hy¬ 
perparameter vector is instead determined through marginal 
likelihood optimization [34], i.e. 

f] = argmax -f logdet (E^) (6.4) 


7.7 Data sets and performance index 

We compare different estimators for discrete-time system 
identification. To this aim, we resort to two Monte Carlo 
studies whose implementation details are the same as de¬ 
scribed in Section 7.2 of [39]. Here, we just recall that each 
Monte Carlo consists of 1000 runs. At each run, g is defined 
by a different rational transfer function, given by the ratio 
of two polynomials of the same order (varying from 1 to 
30), randomly obtained by a Matlab generator. The system. 



100 


Fits (N=300) 


Fits (N=1000) 



Fig. 3. Monte Carlo experiment: boxplots of the fits using 300 (left panel) or 1000 (right panel) input-output samples. Recall that At-i-Or, 
Hank-l-Or and Oe-l-Or are not implementable in practice since they exploit the knowledge of the true impulse response to tune model 
complexity. 




Atomic 

Hankel 

At+Or 

Hank+Or 

Oe+Or 

ITS 

N = 

300 

4.5 

31.9 

59.5 

63.7 

64.8 

65.1 

N = 

1000 

39.2 

63.3 

61A 

73.7 

72.6 

72.9 


Table 2 

Monte Carlo experiment: average fit achieved by PEM equipped with oracle (Oe-l-Or), ReLS based on Atomic norms (Atomic and 
At-l-Or), Hankel nuclear norms (Hankel and Hank-l-Or) and the new estimator based on the stable kernel ITS, using 300 or 1000 input- 
output samples. Recall that At-l-Or, Hank-l-Or and Oe-l-Or are not implementable in practice since they exploit the knowledge of the true 
impulse response to tune model complexity. 
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Fig. 4. Monte Carlo experiment: boxplots of the fits returned by ReLS equipped with different stable kernels using 300 (left panel) or 
1000 (right panel) input-output samples. All the estimators are implementable in practice. 




iTS 

iTC 

iSS 

TC 

SS 

N = 

300 

65.1 

64.8 

64.4 

63.2 

62.3 

N = 

1000 

72.9 

72.7 

71.6 

70.6 

69.2 


Table 3 

Monte Carlo experiment: average fit returned by ReLS equipped with different stable kernels using 300 or 1000 input-output samples. 
All the estimators are implementable in practice. 
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Fig. 5. Monte Carlo experiment using a different system generator: boxplots of the fits achieved by Oe+Or and iTS using 300 (left 
panel) or 1000 (right panel) input-output samples. 


initially at rest, is fed with an input obtained by filtering a 
zero mean unit variance white Gaussian noise through a 2nd 
order rational transfer function which also varies from run 
to run. Output data are corrupted by a white Gaussian noise, 
with the SNR (ratio between the variance of the noiseless 
output and the noise) randomly drawn at every run in the 
interval [1,10]. The first and the second Monte Carlo study 
differ in the number N of available input-output measure¬ 
ment, equal to 300 or 1000, respectively. 

Given an estimator g, its performance index is evaluated 
computing the fit measures 

^/ = 100x 7 = 1,...,1000 (7.1) 

where gj and gj are the true and the estimated impulse 
response at the j-th run. 

7.2 Estimators compared via the Monte Carlo study 

Oe+Or All the impulse response estimators introduced so 
far depend on an unknown parameter vector, denoted by t], 
which controls model complexity. For instance, in all the 
regularized techniques 7] contains at least the regularization 
parameter 7 . When using PEM, 7] instead represents the 
order of different model structures. For instance, consider 
the use of rational transfer functions for discrete-time system 
identification. Then, 7] is the degree of the polynomials B(z) 
and A{z) composing the transfer function given, in the z- 
transfer domain, by 


G{z) 


A(z) 


(7.2) 


Hereby, g is said to be an oracle-based estimator if, having 
access to the true impulse response g, it determines model 
complexity by maximizing the fit measure (7.1) w.r.t. Tj. 
Note that such an impulse response estimator is never im- 
plementable in practice since the true impulse response g is 


not available.This identification procedure is however use¬ 
ful since it provides a performance reference. In particular, 
Oe -f Or denotes the following PEM procedure. First, the 
Matlab function Oe. m is used to fit the model structures 
(7.2) to data for 77 = 1,... ,30 (the information that system 
was initially at rest is given to the estimator). Then, among 
the 30 impulse response estimates obtained, Oe -f Or returns 
that maximizing (7.1). 


{Hankel,Hank+Or} We have implemented two vari¬ 
ants of the estimator (8.2) based on the Hankel nuclear 
norm adopting a FIR model of order m — 99 with the size 
of the Hankel matrix H{g) equal to 50 x 50. In the first 
version, dubbed Hankel, the regularization parameter is 
estimated via cross validation. Data are divided into an 
identification and validation data set of equal size. For every 
value of 7 in the grid defined by the MATFAB command 
logspace (-5,4,50), the solution (8.2) is obtained from 
the identification data using the software CVX [23,24]. The 
regularization parameter 7 is the one leading to the best 
prediction on the validation set according to a quadratic 
fit criterion. Finally, the impulse response estimate is 
achieved by solving ( 8 . 2 ) using 7=7 and all the available 
data. The second version of the estimator exploits an or¬ 
acle which, at every run, selects the value of 7 in the set 
logspace (-5,4,50) which maximizes the fit (7.1). 


{Atomic,At+Or} Two types of atomic estimators (3.6) 
have been implemented. In both the variants, the atomic set 

is defined by the poles = ae'^^ and their complex con¬ 
jugate wl where, using a MATFAB notation, a and j3 take 
values on the two grids [0.02 : 0.02 : 0.98 0.99 0.999] 
and [0 ;r/50 : ;r/50 : :;r] , respectively. Optimization (3.6) is 
then performed constraining each couple of expansions co¬ 
efficients aj^ related to complex conjugate poles to be equal 
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and real|^ The first estimator, dubbed Atomic, obtains at 
each Monte Carlo run the value of the regularization pa¬ 
rameter via 10-fold cross validation. It has been imple¬ 
mented exploiting the MATLAB software package glmnet 
[ 21 ], providing the information on system initial conditions 
and input delay. We have also used the MATLAB com¬ 
mands options.larabda=logspace(-5, 4, 100 ) to de- 
hne the grid where the regularization parameter is searched 
and options . intr=0 to specify that the relation between 
g and the system output is linear (and not affine). The sec¬ 
ond estimator, dubbed At-tOr, is implemented in the same 
way, except that the regularization parameter is selected by 
the oracle. 


{TC,SS,iTC,iSS,iTS} These are the estimators (6.5) 
equipped with the stable kernels described in Section 6 , with 
the dimension of g set to m = 100 and the hyperparameter 
vector 77 estimated via marginal likelihood optimization 
(6.4). Note that all of these estimators are implementable in 
practice 

7.3 Results 

Boxplots of the hts achieved by Atomic, At-tOr, Hankel, 
Hank-tOr, Oe-tOr and iTS in the two Monte Carlo studies 
are displayed in the left and bottom panel of Fig. 3. Table 2 
also shows the average hts. 

It is apparent that the hts of Hankel are signihcantly smaller 
than those returned by Oe-tOr. The more so in the hrst ex¬ 
periment where the average ht of Hankel is 31.9 while that 
of Oe-tOr is 64.8. Instead, the performance of Hank-tOr (7 is 
chosen by the oracle) is virtually identical to that of Oe-tOr. 
As for Atomic, its performance is not satisfactory, inferior 
than that of Hankel. Only using the oracle-based estimator 
At-tOr, the performance becomes comparable with that of 
Oe-tOr. Instead, the integral prior iTS largely outperforms 
the other regularized system identihcation approaches pro¬ 
posed in the literature. Remarkably, its performance is very 
close or also superior to that of the oracle-based approaches. 
For instance, in the two experiments iTS provides average 
hts equal to 65.1 and 72.9 whereas Oe-tOr return 64.8 and 
72.6. 

The benehcial effect of the unequal weighting on the ex¬ 
pansion coefficients is evident also reconsidering Fig. 2: iTS 
(right panel, results with 7 estimated via marginal likelihood) 
outperforms At-tOr (left panel, results with 7 estimated by 
the oracle). 

Finally, Fig. 4 and Table 3 permit to compare the perfor¬ 
mance of all the ReLS approaches equipped with the stable 
kernels. As a matter of fact, all the estimators perform well, 
revealing the importance of informing the estimation pro¬ 
cess of system stability. Notice also that iTS provides the 


2 

We have also implemented the estimator (3.6) equipped with a 
different atomic set given by discrete-time Laguerre basis functions 
[52]. Results (not shown) are similar to those described in the 
sequel obtained adopting (3.3). 


best results, pointing out benefits of integral versions of sta¬ 
ble kernels. 

7.4 Use of a different system generator 

The random generator adopted so far defines challenging 
systems, with poles often located at high frequencies. How¬ 
ever, a visual inspection reveals that the average number 
of signihcant Hankel singular values is rather small, being 
around 10. A consequence is that the mean order selected 
by OeH-Or is around 5. We have thus found of interest to 
repeat the experiment adopting a different “higher-order” 
generator. The rational transfer function order is now fixed 
to 30 and the poles are selected iterating the following pro¬ 
cedure at every Monte Carlo run: With equal probability a 
real or a couple of complex conjugate poles is added to the 
denominator until its order reaches 30. In the case of a real 
pole, it is randomly drawn from a uniform distribution on 
[—0.95,0.95], while the absolute values and phases of the 
complex conjugate pairs are independent random variables 
uniform on [0,95] and [0, 7z], respectively. The zeros are then 
selected in the same way except that their absolute values 
are drawn in the interval [— 2 , 2 ]. 

Boxplots of the ht values are reported in Fig. 5, restricting 
the comparison to OeH-Or and iTS. Remarkably, the advan¬ 
tage of iTS over OeH-Or increases: its average ht is 65.3 vs 
59.3 when N = 300 and 76.7 vs 73.3 when N = 1000. This 
can be explained considering that, on average, OeH-Or now 
selects models of order 13 and this may further undermine 
optimal asymptotic properties of PEM under correct order 
specihcation. In addition, such an estimator has now to hinge 
on higher-dimensional non convex problems, possibly more 
prone to local minima. Under these circumstances, ReLS 
equipped with empirical Bayes can be especially useful, see 
also [34, 35] for insights on marginal likelihood effective¬ 
ness in controlling model complexity. 


8 Real data: temperature prediction 

To test the algorithms on real data we have also considered 
thermodynamic modeling of buildings. We placed sensors 
in two rooms of a small two-hoor residential building of 
about 80 m and 200 m ; the sensors have been placed only 
on one hoor (approximately 40 m ) and their location is 
approximately shown in Fig. 6 (top panel). The larger room 
is the living room while the smaller is the kitchen. The 
experimental data was collected through a wireless sensor 
network made of 8 Tmote-Sky nodes produced by Moteiv 
Inc, each of them is provided with a temperature sensor. 
The building was inhabited during the measurement period, 
which lasted for 8 days starting from February 24th, 2011; 
samples were taken every 5 minutes. The heating systems 
was controlled by a thermostat; the reference temperature 
was manually set every day depending upon occupancy and 
other needs. 

The location of the 8 sensors was as follows: 
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Fig. 6. Nodes location (top) and measured temperatures during the 
first 40 hours (bottom). 

• Node #1 (label 137 in Fig. 6) was above a cabinet (2.5 
meters high). 

• Node #2 (label 111) was above a sideboard, about 1.8 
meters high, close to thermoconvector. 

• Node #3 (label 139) was above a cabinet (2.5 meters 
high). 

• Node #4 (label 140) was placed on a bookshelf (1.5 
meters high). 

• Node #5 (label 141) was placed outside. 

• Node #6 (label 153) was placed above the stove (2 
meters high). 

• Node #7 (label 156) was placed in the middle of the 
room, hanging from the ceiling (about 2 meters high). 

• Node #8 (label 160) was placed above one radiator and 
was meant to provide a proxy of water temperature in 
the heating systems. 

This gives a total of 8 temperature profiles. A preliminary 
inspection of the measured signals, reported in Fig. 6 (bot¬ 
tom panel) reveals the high level of collinearity which is 
well-known to complicate the estimation process in System 
Identification [6,30,49]. 

We only consider Multiple Input-Single Output (MISO) 
models, with the temperature from the first node as output 
(y,) and the other 7 temperatures as inputs {uj, j = 1,..,7). 
We leave identification of a full Multiple Input-Multiple 
Output (MIMO) model for future investigation. We split 
the available data into 2 parts; the first = 1000 tem¬ 
perature samples are the identification data while the last 
Nfgst = 1500 are used for test purposes. The notation y 
identifies the test data. Note that Njj = 1000, with 5 minute 


sampling times, corresponds to around ~ 80 hours', this is 
a rather small time interval and, as such, models based on 
these data cannot capture seasonal variations. Consequently, 
in our experiments we assume a “stationary” environment 
and normalize the data so as to have zero mean and unit 
variance before identification is performed. 

We envision that model predictive based methodologies, 
(see [7] and the recent papers [53], [41], [15]), may be 
effective for these applications and, as such, we evaluate 
our new estimators based on their ability to predict future 
data. The predictive power of the model is measured for 
k-step-ahead prediction on test data, as the fit; 



Identification has been performed using ARMAX+Or, Han- 
kel and iTS. More specifically, ARMAX+Or exploits AR- 
MAX models formed by polynomials of the same order. It 
has access to the test set and selects that model order which 
maximizes In particular, it turns out that setting 

the order to 7 provides the best average fit on an horizon of 
length around 8 hours. 

As for the two regularized estimators, consider ARX models 
of the form 

7 

X' = (g'oy),'+ 

7=1 

where the {g^} are the 8 unknown one-step ahead predictor 
impulse responses. Then, both Hankel and iTS assume the 
form 

N ( 1 . . 8 . 

argmin^ y,-- (g'Oy),.- ^(g^+'O m^),- 

e 1=1 V 7=1 / 7=1 

( 8 . 2 ) 

differing only in the adopted J. Both these estimators are 
then implemented in the same way described in the previous 
section. 

The 3 estimators are compared using as performance indexes 
defined in (8.1). The results are reported in Fig. 7 (left 
panel): similarly to what happened using simulated data, the 
performance of ARMAX+Or and iTS is similar and superior 
than that of Hankel. Sample trajectories of one-hour-ahead 
test data prediction obtained by iTS anzd Hankel are also 
visible in Fig. 7 (right panel). 

9 Conclusions 

The results of this paper highlight the importance of 
Bayesian interpretation of ReLS for system identification. 
In particular, the Bayesian framework offers transparent 
guidelines for selecting regularizers that capture crucial 
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One-hour ahead prediction 



in practice since it exploits the knowledge of the test set to tune model complexity. 


system features. Another use regards the assessment of 
existing regularizers: for instance, the drawbacks of the 
Hankel nuclear norm have been linked to the adoption of 
a Bayesian prior which models the impulse response as a 
nonstationary white noise. 

Similar drawbacks affect also other recent approaches that 
model the impulse response as the superimposition of a 
large set of atoms, employing an atomic norm as regular- 
izer. In the literature, it has been argued that any reasonable 
penalty function should be constant on the set of atoms [9]. 
Actually, this can result in penalty functions with poor ca¬ 
pability of controlling model complexity, thus leading to 
estimators with large variance. Indeed, if the number of 
atoms to be included tends to infinity, all of them being 
e.g. of unit £[ norm, any regularizer including smoothness 
and system stability can not assign the same penalty to the 
atoms. Including smoothness and stability constraints in the 
estimation process, e.g. using stable spline kernels, instead 
leads to a kind of regularizer whose weights are non uni¬ 
form. Advantages are confirmed by the simulation studies: 
in the Monte Carlo experiments the family of regularizers 
induced by stable kernels, including the three novel covari¬ 
ances iTC, iSS and iTS, performs systematically better than 
other nuclear and atomic techniques. 

In conclusion, we also stress that, even if drawbacks of the 
Hankel norm have been here illustrated, this does not mean 
that such regularizer can not be useful for system identi¬ 
fication. For example, the MIMO case has not been well 
investigated yet and there can be also cases where large 
magnitude corruptions or un-modelled dynamics can be well 
described by Hankel or weighted Hankel norms [33,44]. 
An interesting perspective is also the design of estimators 
that combine stable kernels and atomic norms. Preliminary 
work on this can be found in [12,40]. 


Appendix 

9.1 Hankel nuclear norm prior: approximation and 
MCMC reconstruction 

Our aim is to design an MCMC scheme able to reconstruct 
in sampled form the prior 


p^(g)ocexp 




associated to the Hankel nuclear norm regularizer. The key 
point is the definition of a proposal density leading to an 
efficient Metropolis-Hastings update, e.g. see [22]. For these 
purposes, it is useful to introduce the novel regularizer 


i 

According to the Bayesian interpretation of regularization, 
the associated prior is 


P//(^)°=exp 




= exp 




(9.1) 


where is the trace of . The last equality, to¬ 

gether with simple calculations, leads to the following result. 


Proposition 9.1 Let g G K™ andH G where m — 2p — 
1. If g is a random vector with pdf Pnig)’ ifte gj^ 

are independent and Gaussian. In particular, one has 


8k 


^ m-i+l 


if 

if < k ^ TO 


(9.2) 
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Thus, pfj describes the impulse response coefficients as 
white noise whose variance first decreases until k = 
and then increases, a stochastic process whose realizations 
are hardly similar to those of a stable system. Note also that 
the choice of the dimension m of g has an important influ¬ 
ence on the prior shape as the minimum value is reached for 

u _ m+1 

K — 2 ■ 

Coming back to the original Hankel prior, we have ex¬ 
ploited the prior pfj to generate a Markov chain converg¬ 
ing to p//(g) exp setting 2A = l,g G G 

j^50x50 particular, results displayed in Fig. 1 and dis¬ 
cussed in subsection 4.1, have been obtained generating a 
chain of length le6 by a random walk Metropolis scheme. 
More specifically, when the state chain is g , the proposed 
sample is generated as = g^ where all the are 
i.i.d. random vectors drawn from p^. Then, with probabil¬ 
ity min /Ph{s^)^ the Markov chain state 

is set to otherwise = g^. The left panel of Fig. 1 
also shows the standard deviations of the impulse response 
coefficients g^ under the approximated prior p^ character¬ 
ized by (9.1) (dashed line, scaled so that the variances of gj 
under pfj and pfj are equal). The similarity between p^ and 
Pfj confirms that the bad performance of the nuclear norm 
regularizer is due to a prior which shares the same flaws 
pointed out by (9.2). 

9.2 Proof of Proposition 9.1 

We start discussing the functional nature of the problem (5.7) 
using RKHS theory, then obtaining its Bayesian interpreta¬ 
tion. The main point of our proof is to exploit the RKHS 
representation in Theorem 4 on pag. 37 of [14] to connect 
the stable spline estimator with the atomic approaches. 

Just for a while, it is useful to reason in continuous-time 
and introduce the following Sobolev space [1] of functions 
/z: [0,1] ^ M 


h{0) =0, h abs. cont., J h^{t)dt < °° 

with (squared) norm \\h\iy = Jq k{t)^dt. It is well known 
that this is a RKHS with reproducing kernel which coincides 
with the covariance of the Brownian motion and is given, 
for f,s > 0 by 


S{sf) = min(s,t) = 2 ^ 
j=i 



(9.3) 

where C, =- - —j- Combining (9.3) and RKHS theory 

■> U^-7cI2) 

[14], can be also expressed as 


= Ih \ h(t) = ^ hjVlsin —= j f G [0,1], 

I 1=' \y^j) 



(9.4) 


Now, note that min(a^, a*) = Then, still using (9.3) 

we obtain 




2 EC; Sin 
./=i 




(9.5) 


which coincides with (5.4) when t and s are restricted to 
the set of natural numbers N. This expansion is key for our 
characterization. In fact, according to (9.5) and Theorem 4 
on pag. 37 of [14], the RKHS induced by the stable spline 
kernel with domain on N x N contains the following func¬ 
tions g : N —7- M: 


S I sit) = £ g V^sin [ ] ten, E 7^ < “• 

(9.6) 

But (9.4) and (9.6) also reveal that (which contains func¬ 
tions with domain [0,1]) and Jif (which contains functions 
of domain N) share the same atomic expansion coefficients 
gj and hj and are isometrically isomorphic. In view of the 
RKHS connection, the solution of (5.7) can be now obtained 
by the representer theorem for system identification. More 
specifically. Theorem 3 on pag. 671 of [39] and the con¬ 
nection between Bayes estimation of Gaussian processes re¬ 
ported in Sections 1.4 and 1.5 of [51] lead to (5.8) and this 
completes the proof. 
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